30 research outputs found

    Concept Embedding for Information Retrieval

    Full text link
    Concepts are used to solve the term-mismatch problem. However, we need an effective similarity measure between concepts. Word embedding presents a promising solution. We present in this study three approaches to build concepts vectors based on words vectors. We use a vector-based measure to estimate inter-concepts similarity. Our experiments show promising results. Furthermore, words and concepts become comparable. This could be used to improve conceptual indexing process

    A New Lattice-Based Information Retrieval Theory

    No full text
    Logic-based Information Retrieval (IR) models represent the retrieval decision as an implication d → q between a document d and a query q, where d and q are logical sentences. However, d → q is a bi- nary decision, we thus need a measure to estimate the degree to which d implies q, noted P(d → q). The main problems in the logic-based IR models are the difficulties to implement the decision algorithms and to define the uncertainty measure P as a part of the logic. In this study, we chose the Propositional Logic (PL) as the underlying framework. We propose to replace the implication d → q by the material implication d ⊃ q. However, we know that there is a mapping between PL and the lattice theory. In addition, Knuth [13] introduced the notion of degree of inclusion to quantify the ordering relations defined on lattices. There- fore, we position documents and queries on a lattice, where the ordering relation is equivalent to the material implication. In this case, the impli- cation d → q is replaced by an ordering relation between documents and queries, and the uncertainty P(d → q) is redefined using the degree of inclusion measure. This new IR model is: 1- general where it is possible to instantiate most of classical IR models depending on our lattice-based model, 2- capable to formally prove the intuition of Rijsbergen about replacing P (d → q) by P (q|d), and 3- easy to implement

    Word Embedding for Social Book Suggestion

    No full text
    International audienceThis paper presents the joint work of the Universities of Grenoble and Saint-´ Etienne at CLEF 2016 Social Book Search Suggestion Track. The approaches studied are based on personalization, considering the user's profile in the ranking process. The profile is filtered using Word Embedding, by proposing several ways to handle the generated relationships between terms. We find that tackling the problem of " non-topical " only queries is a great challenge in this case. The official results show that Word Embedding methods are able to improve results in the SBS case

    MRIM at ImageCLEF2012. From Words to Concepts: A New Counting Approach

    Get PDF
    Lab ImageCLEF: Cross Language Image Retrieval: Medical Image Classification and Retrieval - Conference website: http://clef2012.orgInternational audienceMRIM research group has participated in two tasks (ad-hoc image-based retrieval and case-based retrieval) of the ImageCLEF2012 Medical Retrieval track. In our contribution, we study the frequency shift problem that happens when using concepts instead of words as indexing terms. The main goal of our experiments is to check the validity of our new counting strategy of concepts (Relative Count), which is proposed as a solution to the frequency shift problem. In order to validate our new counting strategy, we compare the retrieval performance (represented by MAP) of some classical IR models using the classical counting strategy (count each concept as 1) with their performance using the new strategy. The results are promising, and using the new counting strategy shows a considerable gain in performance. We use in our experiments two supplementary resources: MetaMap as a text-to-concepts mapping tool, and UMLS as an external resource containing concepts

    Using social media images for building function classification

    Get PDF
    Urban land use on a building instance level is crucial geo-information for many applications yet challenging to obtain. Steet-level images are highly suited to predict building functions as the building façades provide clear hints. Social media image platforms contain billions of images, including but not limited to street perspectives. This study proposes a filtering pipeline to yield high-quality, ground-level imagery from large-scale social media image datasets to cope with this issue. The pipeline ensures all resulting images have complete and valid geotags with a compass direction to relate image content and spatial objects. We analyze our method on a culturally diverse social media dataset from Flickr with more than 28 million images from 42 cities worldwide. The obtained dataset is then evaluated in the context of a building function classification task with three classes: Commercial, residential, and other. Fine-tuned state-of-the-art architectures yield F1 scores of up to 0.51 on the filtered images. Our analysis shows that the quality of the labels from OpenStreetMap limits the performance. Human-validated labels increase the F1 score by 0.2. Therefore, we consider these labels weak and publish the resulting images from our pipeline and the depicted buildings as a weakly labeled datase

    Geo-Information Harvesting from Social Media Data

    Get PDF
    As unconventional sources of geo-information, massive imagery and text messages from open platforms and social media form a temporally quasi-seamless, spatially multi-perspective stream, but with unknown and diverse quality. Due to its complementarity to remote sensing data, geo-information from these sources offers promising perspectives, but harvesting is not trivial due to its data characteristics. In this article, we address key aspects in the field, including data availability, analysis-ready data preparation and data management, geo-information extraction from social media text messages and images, and the fusion of social media and remote sensing data. We then showcase some exemplary geographic applications. In addition, we present the first extensive discussion of ethical considerations of social media data in the context of geo-information harvesting and geographic applications. With this effort, we wish to stimulate curiosity and lay the groundwork for researchers who intend to explore social media data for geo-applications. We encourage the community to join forces by sharing their code and data.Comment: Accepted for publication IEEE Geoscience and Remote Sensing Magazin

    Geo-Information Harvesting from Social Media Data

    Get PDF
    As unconventional sources of geo-information, massive imagery and text messages from open platforms and social media form a temporally quasi-seamless, spatially multiperspective stream, but with unknown and diverse quality. Due to its complementarity to remote sensing data, geo-information from these sources offers promising perspectives, but harvesting is not trivial due to its data characteristics. In this article, we address key aspects in the field, including data availability, analysisready data preparation and data management, geo-information extraction from social media text messages and images, and the fusion of social media and remote sensing data. We then showcase some exemplary geographic applications. In addition, we present the first extensive discussion of ethical considerations of social media data in the context of geo-information harvesting and geographic applications. With this effort, we wish to stimulate curiosity and lay the groundwork for researchers who intend to explore social media data for geo-applications. We encourage the community to join forces by sharing their code and data

    Conceptual Structure Matching using a Bayesian Framework in a Conceptual Indexing. Application to Medical Domain with Multilingual Documents and UMLS Meta-thesaurus

    No full text
    International audienceInformation Retrieval Systems that compute a matching between a document and a query based on words intersection, cannot reach relevant documents that do not share any terms with the query. The objective of this master thesis is to propose a solution to this problem in the context of conceptual indexing. We study an ontology based matching that exploit links between concepts. We propose a model that exploits the weighted links of ontology. We also propose to extend the links of the ontology to reflect the structural ambiguity of some concepts. A validation of our proposal is made on the test collection ImagCLEFMed 2005 and the external resource UMLS 2005

    Modélisation de la recherche d'information par la logique et les treillis : application à la recherche d'information conceptuelle

    No full text
    This thesis is situated in the context of logic-based Information Retrieval (IR) models. The work presented in this thesis is mainly motivated by the inadequate term-independence assumption, which is well-accepted in IR although terms are normally related, and also by the inferential nature of the relevance judgment process. Since formal logics are well-adapted for knowledge representation, and then for representing relations between terms, and since formal logics are also powerful systems for inference, logic-based IR thus forms a candidate piste of work for building effective IR systems. However, a study of current logic-based IR models shows that these models generally have some shortcomings. First, logic-based IR models normally propose complex, and hard to obtain, representations for documents and queries. Second, the retrieval decision d->q, which represents the matching between a document d and a query q, could be difficult to verify or check. Finally, the uncertainty measure U(d->q) is either ad-hoc or hard to implement. In this thesis, we propose a new logic-based IR model to overcome most of the previous limits. We use Propositional Logic (PL) as an underlying logical framework. We represent documents and queries as logical sentences written in Disjunctive Normal Form. We also argue that the retrieval decision d->q could be replaced by the validity of material implication. We then exploit the potential relation between PL and lattice theory to check if d->q is valid or not. We first propose an intermediate representation of logical sentences, where they become nodes in a lattice having a partial order relation that is equivalent to the validity of material implication. Accordingly, we transform the checking of the validity of d->q, which is a computationally intensive task, to a series of simple set-inclusion checking. In order to measure the uncertainty of the retrieval decision U(d->q), we use the degree of inclusion function Z that is capable of quantifying partial order relations defined on lattices. Finally, our model is capable of working efficiently on any logical sentence without any restrictions, and is applicable to large-scale data. Our model also has some theoretical conclusions, including, formalizing and showing the adequacy of van Rijsbergen assumption about estimating the logical uncertainty U(d->q) through the conditional probability P(q|d), redefining the two notions Exhaustivity and Specificity, and the possibility of reproducing most classical IR models as instances of our model. We build three operational instances of our model. An instance to study the importance of Exhaustivity and Specificity, and two others to show the inadequacy of the term-independence assumption. Our experimental results show worthy gain in performance when integrating Exhaustivity and Specificity into one concrete IR model. However, the results of using semantic relations between terms were not sufficient to draw clear conclusions. On the contrary, experiments on exploiting structural relations between terms were promising. The work presented in this thesis can be developed either by doing more experiments, especially about using relations, or by more in-depth theoretical study, especially about the properties of the Z function.Cette thèse se situe dans le contexte des modèles logique de Recherche d'Information (RI). Le travail présenté dans la thèse est principalement motivé par l'inexactitude de l'hypothèse sur l'indépendance de termes. En effet, cette hypothèse communément acceptée en RI stipule que les termes d'indexation sont indépendant les un des autres. Cette hypothèse est fausse en pratique mais permet tout de même aux systèmes de RI de donner de bon résultats. La proposition contenue dans cette thèse met également l'emphase sur la nature déductive du processus de jugement de pertinence. Les logiques formelles sont bien adaptées pour la représentation des connaissances. Elles permettent ainsi de représenter les relations entre les termes. Les logiques formelles sont également des systèmes d'inférence, ainsi la RI à base de logique constitue une piste de travail pour construire des systèmes efficaces de RI. Cependant, en étudiant les modèles actuels de RI basés sur la logique, nous montrons que ces modèles ont généralement des lacunes. Premièrement, les modèles de RI logiques proposent normalement des représentations complexes de document et des requête et difficile à obtenir automatiquement. Deuxièmement, la décision de pertinence d->q, qui représente la correspondance entre un document d et une requête q, pourrait être difficile à vérifier. Enfin, la mesure de l'incertitude U(d->q) est soit ad-hoc ou difficile à mettre en oeuvre. Dans cette thèse, nous proposons un nouveau modèle de RI logique afin de surmonter la plupart des limites mentionnées ci-dessus. Nous utilisons la logique propositionnelle (PL). Nous représentons les documents et les requêtes comme des phrases logiques écrites en Forme Normale Disjonctive. Nous argumentons également que la décision de pertinence d->q pourrait être remplacée par la validité de l'implication matérielle. Pour vérifier si d->q est valide ou non, nous exploitons la relation potentielle entre PL et la théorie des treillis. Nous proposons d'abord une représentation intermédiaire des phrases logiques, où elles deviennent des noeuds dans un treillis ayant une relation d'ordre partiel équivalent à la validité de l'implication matérielle. En conséquence, nous transformons la vérification de validité de d->q, ce qui est un calcul intensif, en une série de vérifications simples d'inclusion d'ensembles. Afin de mesurer l'incertitude de la décision de pertinence U(d->q), nous utilisons la fonction du degré d'inclusion Z, qui est capable de quantifier les relations d'ordre partielles définies sur des treillis. Enfin, notre modèle est capable de travailler efficacement sur toutes les phrases logiques sans aucune restriction, et est applicable aux données à grande échelle. Notre modèle apporte également quelques conclusions théoriques comme: la formalisation de l'hypothèse de van Rijsbergen sur l'estimation de l'incertitude logique U(d->q) en utilisant la probabilité conditionnelle P(q|d), la redéfinition des deux notions Exhaustivité et Spécificité, et finalement ce modèle a également la possibilité de reproduire les modèles les plus classiques de RI. De manière pratique, nous construisons trois instances opérationnelles de notre modèle. Une instance pour étudier l'importance de Exhaustivité et Spécificité, et deux autres pour montrer l'insuffisance de l'hypothèse sur l'indépendance des termes. Nos résultats expérimentaux montrent un gain de performance lors de l'intégration Exhaustivité et Spécificité. Cependant, les résultats de l'utilisation de relations sémantiques entre les termes ne sont pas suffisants pour tirer des conclusions claires. Le travail présenté dans cette thèse doit être poursuivit par plus d'expérimentations, en particulier sur l'utilisation de relations, et par des études théoriques en profondeur, en particulier sur les propriétés de la fonction Z
    corecore